Rethinking Learning Rate and Batch Size (Part 4): EMA Effects

In "Rethinking Learning Rate and Batch Size (Part 2): Mean Field Theory", we mentioned that one reason for focusing on SignSGD is that we typically use it as a theoretical approximation for Adam—a common simplification strategy when analyzing Adam theoretically. Beyond learning rate analysis, we have also employed this simplification in contexts such as "Configuring Different Learning Rates: Can LoRA Improve Further?" and "Exploring MuP: Cross-Model Scale Transfer of Hyperparameters".

However, is SignSGD truly a good approximation for Adam? An obvious discrepancy is that SignSGD's Update RMS is always 1, whereas Adam's is not. I discovered that the core reason for this difference is momentum, which is ubiquitous in optimizers like Adam, Lion, and Muon. Therefore, in this article, we examine the effects of momentum—more broadly, Exponential Moving Averages (EMA).

Problem Analysis#

From the perspective of Adam, SignSGD corresponds to the special case where $\beta_1=\beta_2=0$, or corresponds to Adam's first update step (regardless of $\beta_1,\beta_2$). Therefore, we believe it must share some commonalities with Adam and can capture some general patterns.

Key Differences Between SignSGD and Adam

Update RMS Discrepancy: SignSGD always has an Update RMS of 1, while Adam's is typically significantly less than 1.

Behavioral Similarity: Adam appears closer to SGD—it seems like an intermediate version between SignSGD and SGD.

Initial Hypothesis: Initially, I thought this difference was caused by $\epsilon$ in Adam's denominator, so I specifically calculated SoftSignSGD with $\epsilon$ in "How Does Adam's Epsilon Affect Learning Rate Scaling Laws?".

Later Realization: In "Why is Adam's Update RMS 0.2?", we estimated Adam's Update RMS through both simulation and theory. The mean field approximation estimate was $\sqrt{\frac{1-\beta_1}{1+\beta_1}}$, which was verified to align well with both simulation results and actual experiments. This result explicitly depends on $\beta_1$, clearly directing our thinking toward momentum.

This leads to the following analytical process. In summary, we can confirm that the role of $\epsilon$ is indeed secondary; the true protagonist is actually momentum—which is essentially a "moving average" of gradients—and this is precisely the main focus of this article: "EMA (Exponential Moving Average)".

Gradient Descent#

To analyze the variations introduced by EMA, we start with SGDM (SGD with momentum). In practice, when using SGD, it is rare not to include momentum:

(1) \[ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t \\[4pt] &\boldsymbol{w}_t = \boldsymbol{w}_{t-1} - \eta_t \boldsymbol{m}_t \end{aligned} \]

In practical use, $\boldsymbol{g}_t$ is replaced by $\tilde{\boldsymbol{g}}_{B,t}$, a random variable with mean $\boldsymbol{g}_t$ and covariance matrix $\boldsymbol{\Sigma}_t/B$. These basic settings are the same as in "Rethinking Learning Rate and Batch Size (Part 1): Current Landscape". The noise here arises from randomly sampling different batches, so we can reasonably assume that $\tilde{\boldsymbol{g}}_{B,t}$ across different $t$ are mutually independent.

Our task is to compute:

(2) \[ \newcommand{tr}{\mathop{\text{tr}}}\eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}\boldsymbol{g}}{\tr(\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]\boldsymbol{H})} \]

The related derivations have been provided in previous articles and won't be repeated here. For SGDM, $\tilde{\boldsymbol{\varphi}}_B = \boldsymbol{m}_t$, which can be expanded as:

(3) \[ \boldsymbol{m}_t = (1 - \beta_1)\sum\limits_{s=1}^t \beta_1^{t-s}\tilde{\boldsymbol{g}}_{B,s} \]

Batch Size Amplification#

Now we can compute:

(4) \[ \mathbb{E}[\boldsymbol{m}_t] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\mathbb{E}[\tilde{\boldsymbol{g}}_{B,s}] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\boldsymbol{g}_s \]

We further assume that when model training enters a "steady state," the gradient changes slowly. Thus, we can approximate $\boldsymbol{g}_s$ with the current gradient $\boldsymbol{g}_t$, yielding:

(5) \[ \mathbb{E}[\boldsymbol{m}_t] = (1 - \beta_1)\sum_{s=1}^t \beta_1^{t-s}\boldsymbol{g}_t = (1 - \beta_1^t) \boldsymbol{g}_t \approx \boldsymbol{g}_t \qquad (t\to\infty) \]

As for $\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}]$, we use the identity $\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}] = \mathbb{E}[\boldsymbol{m}_t] \mathbb{E}[\boldsymbol{m}_t]^{\top} + \mathbb{C}\text{ov}[\boldsymbol{m}_t,\boldsymbol{m}_t]$, then utilize the additivity of variance:

(6) \[ \mathbb{C}\text{ov}[\boldsymbol{m}_t,\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{s=1}^t \beta_1^{2(t-s)}\boldsymbol{\Sigma}_s/B \]

Similarly, assuming slow variation of the covariance matrix, we have:

(7) \[ \mathbb{C}\text{ov}[\boldsymbol{m}_t] \approx (1 - \beta_1)^2\sum_{s=1}^t \beta_1^{2(t-s)}\boldsymbol{\Sigma}_t/B = (1 - \beta_1)^2\frac{1-\beta_1^{2t}}{1-\beta_1^2}\boldsymbol{\Sigma}_t/B = \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\Sigma}_t/B \qquad (t\to\infty) \]

Substituting into equation (2) yields:

(8) \[ \eta^* \approx \frac{\eta_{\max}}{1 + \frac{1 - \beta_1}{1 + \beta_1}\mathcal{B}_{\text{noise}}/B},\qquad \eta_{\max} = \frac{\boldsymbol{g}^{\top}\boldsymbol{g}}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}},\quad\mathcal{B}_{\text{noise}} = \frac{\tr(\boldsymbol{\Sigma}\boldsymbol{H})}{\boldsymbol{g}^{\top}\boldsymbol{H}\boldsymbol{g}} \]

Key Insight: Momentum as Batch Size Amplification

From this result, we can see that introducing the momentum mechanism effectively amplifies SGD's batch size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$. According to my understanding, momentum reduces gradient noise by applying EMA to gradients along the optimization trajectory at low cost, so this result aligns with my interpretation of momentum's purpose.

Signed Momentum#

Further, we consider SignSGDM, which can be viewed as a special case of Lion—essentially SGDM with an added $\newcommand{sign}{\mathop{\text{sign}}}\sign$:

(9) \[ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t \\[4pt] &\boldsymbol{w}_t = \boldsymbol{w}_{t-1} - \eta_t \sign(\boldsymbol{m}_t) \end{aligned} \]

In actual training, $\boldsymbol{g}_t$ is similarly replaced by $\tilde{\boldsymbol{g}}_{B,t}$. For SignSGDM, $\tilde{\boldsymbol{\varphi}}_B = \sign(\boldsymbol{m}_t)$. Using the mean field approximation:

(10) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t}{\sqrt{\boldsymbol{m}_t^2}}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t]}{\sqrt{\mathbb{E}[\boldsymbol{m}_t^2]}} \]

Here, vector multiplication defaults to the Hadamard product. The numerator $\mathbb{E}[\boldsymbol{m}_t]$ was computed in the previous section. The denominator $\mathbb{E}[\boldsymbol{m}_t^2]$ actually equals $\newcommand{diag}{\mathop{\text{diag}}}\diag(\mathbb{E}[\boldsymbol{m}_t \boldsymbol{m}_t^{\top}])$, so we can substitute the results from the previous section to obtain:

(11) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \approx \frac{\boldsymbol{g}_t}{\sqrt{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}} = \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \frac{1 - \beta_1}{1 + \beta_1}(\boldsymbol{\sigma}_t^2/\boldsymbol{g}_t^2)/B}} \approx \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \frac{1 - \beta_1}{1 + \beta_1} \mathcal{B}_{\text{simple}}/B}} \]

where $\boldsymbol{\sigma}_t^2 = \diag(\boldsymbol{\Sigma}_t)$, and $\mathcal{B}_{\text{simple}} = \tr(\boldsymbol{\Sigma}_t)/\boldsymbol{g}_t^{\top}\boldsymbol{g}_t$. The above equation is equivalent to SignSGD with $B$ replaced by $\frac{1 + \beta_1}{1 - \beta_1}B$. If we further compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B\tilde{\boldsymbol{\varphi}}_B^{\top}]$, we would find the same conclusion. Thus, similar to SGDM, momentum effectively amplifies SignSGD's batch size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$.

In "Rethinking Learning Rate and Batch Size (Part 3): Muon", we calculated Muon's learning rate scaling pattern and found it to be consistent with SignSGD. Therefore, we can assert that the role of momentum in Muon is similar to that in SignSGDM—both approximately amplify the batch size by a factor of $\frac{1 + \beta_1}{1 - \beta_1}$.

Double Smoothing#

Finally, we examine Adam:

(12) \[ \begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t \hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right. \end{aligned} \]

In actual training, $\boldsymbol{g}_t$ is replaced by $\tilde{\boldsymbol{g}}_{B,t}$. We consider the state where training has already entered a "steady state," i.e., $t\to\infty$, so we do not distinguish between $\boldsymbol{m}_t$ and $\hat{\boldsymbol{m}}_t$, or between $\boldsymbol{v}_t$ and $\hat{\boldsymbol{v}}_t$. Additionally, to focus on the effect of EMA, we set $\epsilon = 0$. Then for Adam, $\tilde{\boldsymbol{\varphi}}_B=\boldsymbol{m}_t/\sqrt{\boldsymbol{v}_t}$. The difference from SignSGDM is that the denominator $\boldsymbol{m}_t^2$ is replaced by another EMA statistic $\boldsymbol{v}_t$.

Using the mean field approximation:

(13) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t}{\sqrt{\boldsymbol{v}_t}}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t]}{\sqrt{\mathbb{E}[\boldsymbol{v}_t]}} \]

We have already computed $\mathbb{E}[\boldsymbol{m}_t]$, so we only need to compute $\mathbb{E}[\boldsymbol{v}_t]$:

(14) \[ \mathbb{E}[\boldsymbol{v}_t] = (1 - \beta_2)\sum_{s=1}^t \beta_2^{t-s}\mathbb{E}[\tilde{\boldsymbol{g}}_{B,s}^2] = (1 - \beta_2)\sum_{s=1}^t \beta_2^{t-s}(\boldsymbol{g}_s^2 + \boldsymbol{\sigma}_s^2/B)\approx \boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B \]

As before, the last approximation assumes slow variation of gradients and variances, and $t\to\infty$. Thus, we have:

(15) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \approx \frac{\boldsymbol{g}_t}{\sqrt{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B}} \approx \frac{\sign(\boldsymbol{g}_t)}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}} \]

This result is indeed identical to SignSGD, so from the perspective of the first moment, SignSGD as an approximation for Adam is reasonable. However, we also have the second moment $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}]$. Under the assumption of component independence, we only need to compute $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2]$:

(16) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] = \mathbb{E}\bigg[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\bigg]\approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]} \approx \frac{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} \]

Two Special Cases#

We examine two special cases. First, when $\beta_1=0$, the numerator and denominator are identical, making $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2]$ a vector of all ones—consistent with SignSGD. Therefore, SignSGD is a good approximation for Adam with $\beta_1=0$—which is RMSProp. As $\beta_1$ increases, the approximation quality begins to degrade.

When $\beta_1=1$, we have:

(17) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] \approx \frac{\boldsymbol{g}_t^2}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B}\approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2 \]

This yields $\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}] \approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top}$. Substituting into equation (2):

(18) \[ \eta^* \approx \frac{\Vert \boldsymbol{g}\Vert_1 \sqrt{1 + \mathcal{B}_{\text{simple}}/B}}{\sign(\boldsymbol{g})^{\top} \boldsymbol{H} \sign(\boldsymbol{g})} \]

Critical Implication: Surge Phenomenon Acceleration

Note that this is a monotonically decreasing function with respect to $B$, meaning that when batch size increases, the learning rate should decrease. From this, we can infer that increasing $\beta_1$ in Adam will accelerate the emergence of the "Surge phenomenon."

This conclusion might seem somewhat puzzling at first, but it becomes understandable from another perspective. The "Surge phenomenon" refers to the situation where, after batch size exceeds a certain threshold, the optimal learning rate decreases as batch size increases. The previous results for SGDM and SignSGDM indicate that introducing momentum effectively amplifies batch size by a factor of $\frac{1 + \beta_1}{1 - \beta_1} > 1$, which naturally increases the likelihood of exceeding the threshold.

In other words, the conclusion that "as $\beta_1$ increases, the 'Surge phenomenon' becomes more likely to occur" holds even for SignSGDM. While Adam has some new characteristics compared to SignSGDM, the fundamental point that "the momentum mechanism effectively amplifies batch size" remains valid, making the same conclusion understandable.

General Analysis#

Let's rewrite equation (16):

(19) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B^2] \approx \frac{\boldsymbol{g}_t^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}_t^2/B}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} = \frac{2\beta_1}{1+\beta_1}\frac{\boldsymbol{g}_t^2}{\boldsymbol{g}_t^2 + \boldsymbol{\sigma}_t^2/B} + \frac{1 - \beta_1}{1 + \beta_1} \approx \frac{2\beta_1}{1+\beta_1}\mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2 + \frac{1 - \beta_1}{1 + \beta_1} \]

From this, we can write:

(20) \[ \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B \tilde{\boldsymbol{\varphi}}_B^{\top}] \approx \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B] \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^{\top} + \frac{1 - \beta_1}{1 + \beta_1}\diag\left(1 - \mathbb{E}[\tilde{\boldsymbol{\varphi}}_B]^2\right) \]

Then:

(21) \[ \eta^* \approx \frac{\sum_i |g_i|}{\frac{1}{\beta}\frac{1 - \beta_1}{1 + \beta_1}\sum_i H_{i,i} + \beta\left(\sum_{i,j} H_{i,j}\sign(g_i g_j) - \frac{1 - \beta_1}{1 + \beta_1}\sum_i H_{i,i}\right)} \]

Here, the subscriptless $\beta$ equals $(1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}$. Without careful reading, this might be confused with $\beta_1,\beta_2$—I apologize, as this is notation from the previous two articles and we continue using it here. Unlike SignSGD, where assuming a diagonal Hessian matrix eliminates the Surge phenomenon, the above expression still exhibits the Surge phenomenon even under a diagonal Hessian assumption. In that case:

(22) \[ \eta^* \approx \frac{\sum_i |g_i|}{\left(\frac{1}{\beta}\frac{1 - \beta_1}{1 + \beta_1} + \beta\frac{2\beta_1}{1 + \beta_1}\right)\sum_i H_{i,i}} \]

By the inequality of arithmetic and geometric means, the above expression reaches its maximum at $\beta^*=\sqrt{\frac{1-\beta_1}{2\beta_1}}$. However, note that by definition $\beta\in(0,1)$, so we must also check whether $\beta^*\in(0,1)$, i.e., $\beta_1 > 1/3$. If this condition is not satisfied, the maximum still occurs at $\beta=1$, and there is no Surge phenomenon. Conversely, when $\beta_1 > 1/3$ and $\beta > \beta^*$ (i.e., $B > \frac{1-\beta_1}{3\beta_1-1}\mathcal{B}_{\text{simple}}$), the learning rate should decrease as batch size increases.

Explaining Muon's Superior Batch Size Performance

This conclusion can preliminarily explain why Muon supports larger batch sizes. As shown in "Rethinking Learning Rate and Batch Size (Part 3): Muon", Muon's behavior is similar to SignSGDM. Under specific Hessian structure assumptions, it does not exhibit the Surge phenomenon, meaning that increasing batch size can always improve learning efficiency, although the relative gains diminish.

In contrast, Adam under common settings (e.g., $\beta_1=0.9$), even assuming a diagonal Hessian, exhibits the Surge phenomenon. This means that once batch size exceeds a certain value, learning efficiency declines.

Summary#

This article provides a preliminary analysis of how the EMA mechanism in optimizers affects learning rate and batch size scaling laws. It confirms that EMA, particularly the momentum mechanism, subtly alters scaling laws. Optimizers like Adam, which involve dual EMA operations, exhibit some new characteristics distinct from SignSGD.

Citation Information

Original Article: Su Jianlin. Rethinking Learning Rate and Batch Size (Part 4): EMA Effects. Scientific Spaces.

How to cite this translation:

Su, J. Rethinking Learning Rate and Batch Size (Part 4): EMA Effects [Translated by Juanxi Tian]. Scientific Spaces.

BibTeX:

@article{su2025rethinking_lr_bs_part4, title = {Rethinking Learning Rate and Batch Size (Part 4): EMA Effects}, author = {Su, Jianlin}, journal = {Scientific Spaces}, year = {2025}, url = {https://kexue.fm/archives/11301}, note = {Translated by Juanxi Tian (ScalingOpt Team)} }